AITopics | speech rate

Collaborating Authors

speech rate

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Automated evaluation of children's speech fluency for low-resource languages

Zhang, Bowen, Latiff, Nur Afiqah Abdul, Kan, Justin, Tong, Rong, Soh, Donny, Miao, Xiaoxiao, McLoughlin, Ian

arXiv.org Artificial IntelligenceOct-24-2025

Assessment of children's speaking fluency in education is well researched for majority languages, but remains highly challenging for low resource languages. This paper propose s a system to automatically assess fluency by combining a fine-tuned multilingual ASR model, an objective metrics extract ion stage, and a generative pre-trained transformer (GPT) netw ork. The objective metrics include phonetic and word error rates, speech rate, and speech-pause duration ratio. These are interpreted by a GPT -based classifier guided by a small set of human-evaluated ground truth examples, to score fluency. We evaluate the proposed system on a dataset of children's spee ch in two low-resource languages, Tamil and Malay and compare the classification performance against Random Forest and XG - Boost, as well as using ChatGPT -4o to predict fluency directl y from speech input. Results demonstrate that the proposed ap - proach achieves significantly higher accuracy than multimo dal GPT or other methods.

large language model, machine learning, natural language, (22 more...)

arXiv.org Artificial Intelligence

doi: 10.21437/Interspeech.2025-1358

2505.19671

Country: Asia (0.30)

Genre: Research Report > New Finding (0.48)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

HiStyle: Hierarchical Style Embedding Predictor for Text-Prompt-Guided Controllable Speech Synthesis

Zhang, Ziyu, Li, Hanzhao, Hu, Jingbin, Li, Wenhao, Xie, Lei

arXiv.org Artificial IntelligenceOct-1-2025

Controllable speech synthesis refers to the precise control of speaking style by manipulating specific prosodic and paralinguistic attributes, such as gender, volume, speech rate, pitch, and pitch fluctuation. With the integration of advanced generative models, particularly large language models (LLMs) and diffusion models, controllable text-to-speech (TTS) systems have increasingly transitioned from label-based control to natural language description-based control, which is typically implemented by predicting global style embeddings from textual prompts. However, this straightforward prediction overlooks the underlying distribution of the style embeddings, which may hinder the full potential of controllable TTS systems. In this study, we use t-SNE analysis to visualize and analyze the global style embedding distribution of various mainstream TTS systems, revealing a clear hierarchical clustering pattern: embeddings first cluster by timbre and subsequently subdivide into finer clusters based on style attributes. Based on this observation, we propose HiStyle, a two-stage style embedding predictor that hierarchically predicts style embeddings conditioned on textual prompts, and further incorporate contrastive learning to help align the text and audio embedding spaces. Additionally, we propose a style annotation strategy that leverages the complementary strengths of statistical methodologies and human auditory preferences to generate more accurate and perceptually consistent textual prompts for style control. Comprehensive experiments demonstrate that when applied to the base TTS model, HiStyle achieves significantly better style controllability than alternative style embedding predicting approaches while preserving high speech quality in terms of naturalness and intelligibility. Audio samples are available at https://anonymous.4open.science/w/HiStyle-2517/.

arxiv preprint arxiv, large language model, machine learning, (13 more...)

arXiv.org Artificial Intelligence

2509.25842

Country: Asia (0.28)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Fast-VGAN: Lightweight Voice Conversion with Explicit Control of F0 and Duration Parameters

Abrassart, Mathilde, Obin, Nicolas, Roebel, Axel

arXiv.org Artificial IntelligenceJul-8-2025

Precise control over speech characteristics, such as pitch, duration, and speech rate, remains a significant challenge in the field of voice conversion. The ability to manipulate parameters like pitch and syllable rate is an important element for effective identity conversion, but can also be used independently for voice transformation, achieving goals that were historically addressed by vocoder-based methods. In this work, we explore a convolutional neural network-based approach that aims to provide means for modifying fundamental frequency (F0), phoneme sequences, intensity, and speaker identity. Rather than relying on disentanglement techniques, our model is explicitly conditioned on these factors to generate mel spectrograms, which are then converted into waveforms using a universal neural vocoder. Accordingly, during inference, F0 contours, phoneme sequences, and speaker embeddings can be freely adjusted, allowing for intuitively controlled voice transformations. We evaluate our approach on speaker conversion and expressive speech tasks using both perceptual and objective metrics. The results suggest that the proposed method offers substantial flexibility, while maintaining high intelligibility and speaker similarity.

artificial intelligence, deep learning, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2507.04817

Country:

Europe > United Kingdom > England > East Sussex > Brighton (0.04)
Europe > France > Île-de-France > Paris > Paris (0.04)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Analyzing and Improving Speaker Similarity Assessment for Speech Synthesis

Carbonneau, Marc-André, van Niekerk, Benjamin, Seuté, Hugo, Letendre, Jean-Philippe, Kamper, Herman, Zaïdi, Julian

arXiv.org Artificial IntelligenceJul-4-2025

Modeling voice identity is challenging due to its multifaceted nature. In generative speech systems, identity is often assessed using automatic speaker verification (ASV) embeddings, designed for discrimination rather than characterizing identity. This paper investigates which aspects of a voice are captured in such representations. We find that widely used ASV embeddings focus mainly on static features like timbre and pitch range, while neglecting dynamic elements such as rhythm. We also identify confounding factors that compromise speaker similarity measurements and suggest mitigation strategies. To address these gaps, we propose U3D, a metric that evaluates speakers' dynamic rhythm patterns. This work contributes to the ongoing challenge of assessing speaker identity consistency in the context of ever-better voice cloning systems. We publicly release our code.

artificial intelligence, machine learning, utterance, (16 more...)

arXiv.org Artificial Intelligence

2507.02176

Country:

North America > United States > Pennsylvania (0.04)
North America > Canada > Quebec > Montreal (0.04)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)
(2 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Leisure & Entertainment > Sports > Track & Field (0.40)
Leisure & Entertainment > Sports > Running (0.40)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.67)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.49)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.48)

Add feedback

CBF-AFA: Chunk-Based Multi-SSL Fusion for Automatic Fluency Assessment

Wade, Papa Séga, Andries, Mihai, Kanellos, Ioannis, Moudenc, Thierry

arXiv.org Artificial IntelligenceJun-27-2025

Automatic fluency assessment (AFA) remains challenging, particularly in capturing speech rhythm, pauses, and disfluencies in non-native speakers. We introduce a chunk-based approach integrating self-supervised learning (SSL) models (Wav2Vec2, HuBERT, and WavLM) selected for their complementary strengths in phonetic, prosodic, and noisy speech modeling, with a hierarchical CNN-BiLSTM framework. Speech is segmented into breath-group chunks using Silero voice activity detection (Silero-VAD), enabling fine-grained temporal analysis while mitigating over-segmentation artifacts. SSL embeddings are fused via a learnable weighted mechanism, balancing acoustic and linguistic features, and enriched with chunk-level fluency markers (e.g., speech rate, pause durations, n-gram repetitions). The CNN-BiLSTM captures local and long-term dependencies across chunks. Evaluated on Avalinguo and Speechocean762, our approach improves F1-score by 2.8 and Pearson correlation by 6.2 points over single SSL baselines on Speechocean762, with gains of 4.2 F1-score and 4.0 Pearson points on Avalinguo, surpassing Pyannote.audio-based segmentation baselines. These findings highlight chunk-based multi-SSL fusion for robust fluency evaluation, though future work should explore generalization to dialects with irregular prosody.

artificial intelligence, machine learning, speech recognition, (19 more...)

arXiv.org Artificial Intelligence

2506.20243

Country: Europe > France (0.16)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.48)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.34)

Add feedback

Improving Neural Diarization through Speaker Attribute Attractors and Local Dependency Modeling

Palzer, David, Maciejewski, Matthew, Fosler-Lussier, Eric

arXiv.org Artificial IntelligenceJun-9-2025

ABSTRACT In recent years, end-to-end approaches have made notable progress in addressing the challenge of speaker diarization, which involves segmenting and identifying speakers in multi-talker recordings. One such approach, Encoder-Decoder Attractors (EDA), has been proposed to handle variable speaker counts as well as better guide the network during training. In this study, we extend the attractor paradigm by moving beyond direct speaker modeling and instead focus on representing more detailed'speaker attributes' through a multistage process of intermediate representations. Additionally, we enhance the architecture by replacing transformers with conformers, a convolution-augmented transformer, to model local dependencies. Experiments demonstrate improved di-arization performance on the CALLHOME dataset.

artificial intelligence, attractor, machine learning, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1109/ICASSP48485.2024.10446213

2506.05593

Genre: Research Report (0.70)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.47)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)

Add feedback

The realization of tones in spontaneous spoken Taiwan Mandarin: a corpus-based survey and theory-driven computational modeling

Lu, Yuxin, Chuang, Yu-Ying, Baayen, R. Harald

arXiv.org Artificial IntelligenceMar-29-2025

A growing body of literature has demonstrated that semantics can co-determine fine phonetic detail. However, the complex interplay between phonetic realization and semantics remains understudied, particularly in pitch realization. The current study investigates the tonal realization of Mandarin disyllabic words with all 20 possible combinations of two tones, as found in a corpus of Taiwan Mandarin spontaneous speech. We made use of Generalized Additive Mixed Models (GAMs) to model f0 contours as a function of a series of predictors, including gender, tonal context, tone pattern, speech rate, word position, bigram probability, speaker and word. In the GAM analysis, word and sense emerged as crucial predictors of f0 contours, with effect sizes that exceed those of tone pattern. For each word token in our dataset, we then obtained a contextualized embedding by applying the GPT-2 large language model to the context of that token in the corpus. We show that the pitch contours of word tokens can be predicted to a considerable extent from these contextualized embeddings, which approximate token-specific meanings in contexts of use. The results of our corpus study show that meaning in context and phonetic realization are far more entangled than standard linguistic theory predicts.

large language model, machine learning, tone pattern, (20 more...)

arXiv.org Artificial Intelligence

2503.23163

Country:

Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
Oceania > New Zealand (0.04)
(9 more...)

Genre: Research Report > New Finding (1.00)

Industry: Education > Educational Setting (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.86)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.54)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

Decoding the Flow: CauseMotion for Emotional Causality Analysis in Long-form Conversations

Zhang, Yuxuan, Li, Yulong, Yu, Zichen, Tang, Feilong, Lu, Zhixiang, Li, Chong, Dang, Kang, Su, Jionglong

arXiv.org Artificial IntelligenceJan-1-2025

Long-sequence causal reasoning seeks to uncover causal relationships within extended time series data but is hindered by complex dependencies and the challenges of validating causal links. To address the limitations of large-scale language models (e.g., GPT-4) in capturing intricate emotional causality within extended dialogues, we propose CauseMotion, a long-sequence emotional causal reasoning framework grounded in Retrieval-Augmented Generation (RAG) and multimodal fusion. Unlike conventional methods relying only on textual information, CauseMotion enriches semantic representations by incorporating audio-derived features-vocal emotion, emotional intensity, and speech rate-into textual modalities. By integrating RAG with a sliding window mechanism, it effectively retrieves and leverages contextually relevant dialogue segments, thus enabling the inference of complex emotional causal chains spanning multiple conversational turns. To evaluate its effectiveness, we constructed the first benchmark dataset dedicated to long-sequence emotional causal reasoning, featuring dialogues with over 70 turns. Experimental results demonstrate that the proposed RAG-based multimodal integrated approach, the efficacy of substantially enhances both the depth of emotional understanding and the causal inference capabilities of large-scale language models. A GLM-4 integrated with CauseMotion achieves an 8.7% improvement in causal accuracy over the original model and surpasses GPT-4o by 1.2%. Additionally, on the publicly available DiaASQ dataset, CauseMotion-GLM-4 achieves state-of-the-art results in accuracy, F1 score, and causal reasoning accuracy.

large language model, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2501.00778

Country: Asia (0.68)

Genre: Research Report (1.00)

Industry: Health & Medicine > Therapeutic Area (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Analysis of Speech Temporal Dynamics in the Context of Speaker Verification and Voice Anonymization

Tomashenko, Natalia, Vincent, Emmanuel, Tommasi, Marc

arXiv.org Artificial IntelligenceDec-22-2024

Abstract--In this paper, we investigate the impact of speech methods use large-scale pre-trained models for extracting specific temporal dynamics in application to automatic speaker verification attributes and provide better content and privacy preservation than and speaker voice anonymization tasks. We propose several signal processing based methods. The diversity of approaches is metrics to perform automatic speaker verification based only illustrated by the VoicePrivacy 2024 Challenge [10], which provided on phoneme durations. Experimental results demonstrate that six baseline anonymization systems, namely anonymization using x-phoneme durations leak some speaker information and can reveal vectors and a neural source-filter model [6], [11], signal processing speaker identity from both original and anonymized speech. While specific studies have been dedicated to speaker information carried by pitch [5], [6], [8], the impact of speech temporal dynamics on speaker verification and re-identification has been overlooked.

anonymization, artificial intelligence, speech recognition, (15 more...)

arXiv.org Artificial Intelligence

2412.17164

Country:

Europe > France > Grand Est > Meurthe-et-Moselle > Nancy (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Europe > France > Hauts-de-France > Nord > Lille (0.04)
Asia (0.04)

Genre: Research Report > New Finding (0.34)

Industry: Information Technology > Security & Privacy (0.69)

Technology: Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)

Add feedback

Mmm whatcha say? Uncovering distal and proximal context effects in first and second-language word perception using psychophysical reverse correlation

Tuttösí, Paige, Yeung, H. Henny, Wang, Yue, Wang, Fenqi, Denis, Guillaume, Aucouturier, Jean-Julien, Lim, Angelica

arXiv.org Artificial IntelligenceJun-8-2024

Acoustic context effects, where surrounding changes in pitch, rate or timbre influence the perception of a sound, are well documented in speech perception, but how they interact with language background remains unclear. Using a reverse-correlation approach, we systematically varied the pitch and speech rate in phrases around different pairs of vowels for second language (L2) speakers of English (/i/-/I/) and French (/u/-/y/), thus reconstructing, in a data-driven manner, the prosodic profiles that bias their perception. Testing English and French speakers (n=25), we showed that vowel perception is in fact influenced by conflicting effects from the surrounding pitch and speech rate: a congruent proximal effect 0.2s pre-target and a distal contrastive effect up to 1s before; and found that L1 and L2 speakers exhibited strikingly similar prosodic profiles in perception. We provide a novel method to investigate acoustic context effects across stimuli, timescales, and acoustic domain.

context effect, perception, speech rate, (15 more...)

arXiv.org Artificial Intelligence

2406.05515

Country:

North America > Canada (0.05)
Europe > France (0.05)
Oceania > Vanuatu (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Cognitive Science (1.00)
Information Technology > Artificial Intelligence > Natural Language (0.71)
Information Technology > Artificial Intelligence > Vision (0.68)

Add feedback